11/21/2020
#Here, we look at the relevant data from Lahman #Select columns from the Pitching dataset pitchers <- select(tibble(Pitching), playerID, yearID, teamID, IPouts, BB, SO, BAOpp, ERA, W, L) #Create a Net Wins column pitchers <- pitchers %>% mutate(NetWins = W-L) #Only keep rows where there is no missing data pitchers <- pitchers[complete.cases(pitchers),] #Normalize data so that coefficients are meaningful pitchers <- pitchers %>% mutate(normIPouts = (IPouts - mean(IPouts)) / sd(IPouts)) pitchers <- pitchers %>% mutate(normBB = (BB - mean(BB)) / sd(BB)) pitchers <- pitchers %>% mutate(normSO = (SO - mean(SO)) / sd(SO)) pitchers <- pitchers %>% mutate(normBAOpp = (BAOpp - mean(BAOpp)) / sd(BAOpp)) pitchers <- pitchers %>% mutate(normERA = (ERA - mean(ERA)) / sd(ERA))
-Finally, we can build our model, as is accomplished below.
#Build Linear Model mylm <- lm(NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA, data = pitchers) #Analyze Findings summary(mylm)
## ## Call: ## lm(formula = NetWins ~ normIPouts + normBB + normSO + normBAOpp + ## normERA, data = pitchers) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.0234 -1.4264 0.3797 1.3818 22.3251 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0002319 0.0158333 0.015 0.9883 ## normIPouts 1.3883292 0.0431626 32.165 < 2e-16 *** ## normBB -1.8568112 0.0353850 -52.475 < 2e-16 *** ## normSO 1.2529491 0.0316495 39.588 < 2e-16 *** ## normBAOpp -0.0398335 0.0159113 -2.503 0.0123 * ## normERA -0.1265619 0.0163386 -7.746 9.68e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.288 on 43110 degrees of freedom ## Multiple R-squared: 0.1389, Adjusted R-squared: 0.1388 ## F-statistic: 1391 on 5 and 43110 DF, p-value: < 2.2e-16
summary(mylm)$r.squared
## [1] 0.1389159
-That last model sucked. Let’s try again with a better model.
## ## Family: gaussian ## Link function: identity ## ## Formula: ## NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0002319 0.0158333 0.015 0.9883 ## normIPouts 1.3883292 0.0431626 32.165 < 2e-16 *** ## normBB -1.8568112 0.0353850 -52.475 < 2e-16 *** ## normSO 1.2529491 0.0316495 39.588 < 2e-16 *** ## normBAOpp -0.0398335 0.0159113 -2.503 0.0123 * ## normERA -0.1265619 0.0163386 -7.746 9.68e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## ## R-sq.(adj) = 0.139 Deviance explained = 13.9% ## GCV = 10.81 Scale est. = 10.809 n = 43116
## NULL
##Building a Generalized Linear Model
-That last model sucked too. Let’s try another one–a generalized linear model.
## ## Call: ## glm(formula = NetWins ~ normIPouts + normBB + normSO + normBAOpp + ## normERA, family = gaussian, data = pitchers) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -22.0234 -1.4264 0.3797 1.3818 22.3251 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0002319 0.0158333 0.015 0.9883 ## normIPouts 1.3883292 0.0431626 32.165 < 2e-16 *** ## normBB -1.8568112 0.0353850 -52.475 < 2e-16 *** ## normSO 1.2529491 0.0316495 39.588 < 2e-16 *** ## normBAOpp -0.0398335 0.0159113 -2.503 0.0123 * ## normERA -0.1265619 0.0163386 -7.746 9.68e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 10.80891) ## ## Null deviance: 541146 on 43115 degrees of freedom ## Residual deviance: 465972 on 43110 degrees of freedom ## AIC: 224998 ## ## Number of Fisher Scoring iterations: 2
## NULL
-Wow, that went well. Let’s see if we can figure out any kind of a model that works at all–even ones that are nonlinear and completely unintuitive to normal humans.
-We’re going to throw everything at it. We are inevitable.
-Put on that infinity glove and snap your fingers
Well look at that, we snapped our fingers and explained just over half of the variation. Excellent work team!
Clearly, this isn’t going as well as it might’ve. We decided to look at a crossplot of what sorts of correlation exists.
To me, the most interesting thing was strikeouts by OBP, so I made an interactive graph.